NumPy: computing with arrays

Numerical Python (NumPy) is the fundamental package for high performance scientific computing and data analysis. It is the founding for most high-level tools. Features:

  • ndarray: fast and space-efficient multidimensional array providing vectorized arithmetic operations
  • Standard mathematical functions for fast operations on entire arrays of data without having to write loops
  • tools for reading / writing array data to disk and working with memory-mapped files
  • Linear algebra, random number generation, and Fourier transform capabilities
  • tools for integrating C, C++, Fortran codes

For most data analysis applications, the main functionalities we might be interested in are:

  • fast vectorized array operations: data munging, cleaning, subsetting, filtering, transformation
  • sorting algorithm, set operations
  • efficient descriptive statistics, aggregating and summarizing data
  • data alignement and relational data manipulation for merging heterogeneous data sets
  • expressing conditional logic as array expressions
  • group-wise data manipulations

NumPy provides the computational foundation for these operations but you might want to use pandas as your basis for data analysis as it provides a rich, high-level interface masking most common data tasks very concise and simple. It also provides some more domain specific functionality like time series manipulation, absent in NumPy.

Data types for ndarrays

dtype is a special object containing the information the ndarrays needs to interpret a chunk of memory as a particular type of data:

arr = np.array([1,2,3], dtype=np.float64)

dtype objects are part of what makes NumPy so powerful and flexible. In most of the cases they map directly onto an underlying machine representation. This makes it easy to read/write binary streams of data to disk and connect to code written in languages such as C or Fortran. The most common numerical types are floatXX and intXX where XX stands for the number of bits (tipically 64 for float and 32 for int).

An array originally created with one type can be converter or casted to another type:


In [ ]:
import numpy as np
arr = np.array([1,2,3])
print(arr," is of type ",arr.dtype)
float_arr = arr.astype(np.float64)
print(float_arr," is of type ",float_arr.dtype)

However, pay attention to the fact that a astype() produces a new object.

Numbers introduced as strings can also be casted to numerical values:


In [ ]:
numeric_string = np.array(input("Type a sequence of numbers separated by spaces: ").split())

In [ ]:
numeric_string.astype(float)

One array can be casted to another array's dtype:


In [ ]:
int_array = np.arange(10)
other_array = np.array([0.5, 1.2], dtype=float)
int_array.astype(other_array.dtype)

Some reminders on indexing and slicing

Indexing and slicing in NumPy is a topic that could fill pages and pages. A single array element or a subset can be selected in multiple ways. The following will be a short reminder on array indexing and slicing in NumPy. In the case of one-dimensional arrays there are apparently no differences with respect to indexing in Python lists:


In [ ]:
vec = np.arange(10)*3 + 0.5
vec

We obtain the sixth element of the array similarly to what one would do in a list:


In [ ]:
vec[5]

The subset formed by the sixth, seventh, and eigth element:


In [ ]:
vec[5:8]  #as in lists, first index is included, last index is omitted

We can modify the value of a single element:


In [ ]:
vec[5] = 11
vec

Or a subset of the vector:


In [ ]:
vec [:3] = [11, 12, 13]
vec

If an scalar is assigned to a slice of the array, the value is broadcasted:


In [ ]:
vec [-3:] = 66
vec

READ THIS An importat difference with respect to lists is that slices of a given array are views on the original array. This means that the data is not copied and any modification will be reflected in the source array:


In [ ]:
print("This is the vector",vec)
slice = vec[5:8] #we obtain a slice of the vector
slice[1] = 1111 #modify the second element in the slice
print("This is the vector",vec)

This can be quite surprising and even inconvenient for those coming from programming language such as Fortran90 that copy data more zealously. If you want to actually copy a slice of an ndarray instead of just viewing it, you can either create a new array or use the copy() method:

slice = np.array(vec[5:8])
slice = vec[5:8].copy()

With higher dimensional arrays more possibilities are available. For instance, in the case of a 2D array, the elements at each index are no longer scalars but rather one-dimensional arrays. Similarly in the case of a 3D array, at each index we would have a 2D array composed of 1D arrays at each subindex, and so on.


In [ ]:
arr2D = np.arange(9).reshape(3,3)
arr2D[2]

Individual elements can be accessed recursively. You can use either the syntax employed in lists or the more compact and nice comma-separated list of indices to select individual elements.


In [ ]:
arr2D[2][1]

In [ ]:
arr2D[2,1]

In the case of multidimensional arrays, omitting later indices will produce lower-dimensional array consisting of all the data along the higher dimensions:


In [ ]:
arr3D = np.arange(27).reshape(3,3,3)
arr3D

Check that, for instance, arr3D[0] is a 3x3 matrix:


In [ ]:

Similarly, arr3D[2,1] gives us all the values of the three-indexed variable whose indices start with (1,0), forming a 1D array:


In [ ]:

Slices

Like lists, ndarrays can be sliced using the familiar syntax:


In [ ]:
vec[1:3]

Higher dimensional objects can be sliced along one or more axes and also mix integers. For instance in the 2D case:


In [ ]:
arr2D[:2]

The matrix is sliced along its first axis (rows) an the first two elements are given. One can pass multiple slices:


In [ ]:
arr2D[:2,1:]

Slice the first two rows and columns from the second onwards. Keep in mind that slices are views of the original array. The colon by itself means that the entire axis is taken:


In [ ]:
arr2D[:,-1] #produce the last column of the matrix

Boolean indexing

To illustrate this indexing style let us build a 7x5 matrix filled with random numbers distributed between 0 and 20.


In [ ]:
arr2D = np.random.randint(0,20,35).reshape(7,5)

A second sequence, a 1D array composed by 7 elements.


In [ ]:
arr1D = np.random.randn(7)

Suppose each value in arr1D corresponds to a row in the arr2D array and we want to select all the rows corresponding to absolute values in arr1D greater than 0.5. If we apply a logical operation onto the array the result is a NumPy array of the same shape with a boolean value according to the outcome of the logical operation on the corresponding element.


In [ ]:
np.abs(arr1D) > 0.5

This boolean array can be passed when indexing the array:


In [ ]:
arr2D[np.abs(arr1D) > 0.5]

Clearly, the boolean array must be of the same length as the axis it's indexing. Boolean arrays can be mixed with slices or integer indices:


In [ ]:
arr2D[np.abs(arr1D) > 0.5, -1] #last column of those rows matching the logical condition

Boolean conditions can be combined using logical operators like &(equivalent to and) and | (or)


In [ ]:
arr2D[(np.abs(arr1D) > 0.5) & (arr1D > 0)]

However, keep in mind that the Python keywords andand or do not work with boolean arrays.

We can also apply to the array a mask produced by a logical condition on the array itself. Which elements of arr2D are even?


In [ ]:

Fancy Indexing

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.


In [ ]:
arr2D

In [ ]:
arr2D[[0,2]] #Fancy index is an array requesting rows 0 and 2

Using negative indices selects rows from the end:


In [ ]:
arr2D[[-1,-3]]

Passing multiple indices does something slightly different; it selects a 1D array of elements corresponding to each tuple of indices:


In [ ]:
arr2D[[0,2],[-1,-3]]

Check that we have selected elements (0,-1) and (2,-3):


In [ ]:

Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping which returns a view on the underlying data without copying anything. Arrays have the transpose method and also the special T attribute:


In [ ]:
arr2D.shape

In [ ]:
arr2D.T.shape

In [ ]:
arr2D

In [ ]:
arr2D.T

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes:


In [ ]:
arr = np.arange(16).reshape((2,2,4))
arr

In [ ]:
arr.transpose(1,0,2)

In [ ]:
arr[0]

In [ ]:
arr.transpose(1,0,2)[1]

Universal Functions: Fast Element-wise Array Funcions

A universal function (ufunc) is a function that performs elementwise operations on data stired in ndarrays. They can be seen as fast vectorized wrappers for simple functions that take an array of values and produce an array of results. Many such ufuncs are elementwise representations and are called unary ufuncs:


In [ ]:
arr1D = np.arange(10)
np.sqrt(arr1D)

Other functions take two arrays as arguments. They are called binary ufuncs and some examples are add or maximum.


In [ ]:
i = np.random.randint(8, size=8)
i

In [ ]:
j = np.random.randint(8, size=8)
j

In [ ]:
np.maximum(i,j) #produces element-wise maxima

See the following link for more on ufuncs.

Data Processing Using Arrays

Let us start with a simple example, we will evaluate the function $\sqrt{x^2 + y^2}$ on a grid of equally spaced $(x,y)$ values.


In [ ]:
points = np.arange(-5, 5, 0.01)
print(points.ndim, points.shape, points.size)

The np.mesgrid() function takes two 1D arrays and produces two 2D matrices corresponding to all pairs of (x, y) in the two arrays:


In [ ]:
import numpy as np
x, y = np.meshgrid(points, points)

In [ ]:
y

To evaluate the function we simply write the same expression we would write for two scalar values, only now x and y are arrays. Being of the same dimension, the operation is carried out element wise:


In [ ]:
zp = np.sqrt(x**2 + y**2)

In [ ]:
zp

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:
plt.imshow(zp, cmap=plt.cm.gray)
plt.colorbar()
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a evenly spaced grid")

Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized vrsion of the ternary expression:

x if condition else y

To see how it works, let us build first two arrays 1D:


In [ ]:
xarr = np.linspace(1.1,1.5,5)
yarr = np.linspace(2.1,2.5,5)

We also have a boolean array of the same dimension, for instance:


In [ ]:
condition = np.array([True, False, True, True, False])

Let us build a new array combining xarr and yarr in such a way that whenever condition is True we use the value in xarr and yarr otherwise:


In [ ]:
new_arr = [x if c else y for x,y,c in zip(xarr, yarr, condition)]
new_arr

The latter is not the most efficient solution:

  • it will not be very fast for large arrays
  • it will not work if the arrays are multidimensional

The np.where function offers a concise and fast alternative:


In [ ]:
new_arr = np.where(condition, xarr, yarr)
new_arr

Exercise

  • Build a 4x4 array composed by randomly distributed values
  • Build a new 4x4 arrays that contains the value 2 where the original array has a positive value and -2 where the original array contains a negative one
  • Build another array setting only the positive elements to 2

In [ ]:
arr = np.random.randn(16).reshape(4,4)
arr

In [ ]:
new1 = np.where(arr > 0, 2, -2)
new1

In [ ]:
new2 = np.where(arr > 0, 2, arr)
new2

Mathematical and Statistical Methods

Numpy offers a set of mathematical functions that compute statistics about an entire array or about the data along an axis. These functions are accessed as array methods. Reductions or aggregations such as sum, mean, and std can be used by calling the array method or the Numpy function:


In [ ]:
import numpy as np
arr = np.random.randn(5,4)

In [ ]:
arr

In [ ]:
arr.mean()

In [ ]:
np.mean(arr)

Functions like mean and sum take an optional axisargument which computes the statistic over the given axis, yielding an array with one fewer dimension


In [ ]:
arr.mean(axis=1)

In [ ]:
arr.mean(1)

Other methods like cumsum and cumprod do not aggregate but they produce an array of the intermediate results:


In [ ]:
arr = np.arange(9).reshape(3,3)
arr

In [ ]:
arr.cumsum() #cumulative sum of the complete matrix

In [ ]:
arr.cumsum(0) #cumulative sum over columns

In [ ]:
arr.cumsum(1) #cumulative sum over rows

Methods for Boolean Arrays

Boolean take the values either 1 (True) or 0 (False) when applying statistical methods. For instance, we could count the True values in a boolean array:


In [ ]:
arr = np.random.randn(100)
(arr > 0).sum()

There are two additional methods that are particularly useful for boolean arrays: any and all. any tests whether one or more values in an array is True, while all checks if every value is True:


In [ ]:
np.random.randint?

In [ ]:
boolarr = np.array(np.random.randint(0,2,size=10),dtype=np.bool)
boolarr

In [ ]:
boolarr.any()

In [ ]:
boolarr.all()

By the way, these two methods can be applied to non-boolean arrays, non-zero elements evaluate to True.

Sorting

Numpy arrays can be sorted in-place using the sort method:


In [ ]:
arr = np.random.randn(10)
arr

In [ ]:
arr.sort()
arr

In the case of multidimensional arrays can have each 1D section of values sorted in-place along an axis.


In [ ]:
arr = np.random.randn(5,3) # 5x3 matrix with random values
arr

In [ ]:
arr.sort(1) #sort the rows 
arr

In [ ]:
arr.sort(0) #sort the columns 
arr

It is worthwhile mentioning that the np.sort() function is not fully equivalent to the sort method. While the method sorts the array in-place, the function creates a new function.

Exercise Write a piece code that:

  • builds a 1D array of 1000 normally distributed random numbers
  • compute 10% quantile of the array (i.e. the last value in the top 10% values)

In [ ]:
arr = np.random.randn(5)
arr.sort()
quantile = 0.1
arr[int(quantile*len(arr))]

Linear Algebra

Keep in mind that multiplying two arrays with * results in an element-wise product instead of a vectorial (or matrix) dot product. For that, NumPy provides both an array method and a function:


In [ ]:
x = np.linspace(1,6,6).reshape(2,3)
y = np.random.randint(0,30,size=6).reshape(3,2)

In [ ]:
x

In [ ]:
y

In [ ]:
x.dot(y)

The numpy.linalg module of the NumPy module has a set of matrix decomposition, inverse, and determinant functions. These are implemented under the hood using the same industry-standard Fortran libraries used in other languages like MATLAB and R, such as BLAS, LAPACK or Intel MKL:


In [ ]:
from numpy.linalg import inv, qr

In [ ]:
X = np.random.randn(5,5)

In [ ]:
mat = X.T.dot(X) #Matrix product between X and its transpose

In [ ]:
inv(mat)

In [ ]:
mat.dot(inv(mat))

Type qr?:


In [ ]:


In [ ]:
q, r = qr(mat)

See Numpy Documentation for the list of numpy.linalg functions